Search CORE

Collection Of Biostatistics Research Archive

Case-cohort Methods for Survival Data on Families from Routine Registers

Author: Borgan Ørnulf
Moger Tron Anders
Pawitan Yudi
Publication venue: Collection of Biostatistics Research Archive
Publication date: 13/01/2006
Field of study

In the Nordic countries, there exist several registers containing information on diseases and risk factors for millions of individuals. This information can be linked into families by use of personal identification numbers, and represent a great opportunity for studying diseases that show familial aggregation. Due to the size of the registers, it is difficult to analyze the data by using traditional methods for multivariate survival analysis, such as frailty or copula models. Since the size of the cohort is known, case-cohort methods based on pseudo-likelihoods are suitable for analyzing the data. We present methods for sampling control families both with and without replacement, and with or without stratification. The data are stratified according to family size and covariate values. Depending on the sampling method, results from simulations indicate that one only needs to sample 1%-5% of the control families in order to get good efficiency compared to a traditional cohort analysis. We also provide an application to survival data from the Medical Birth Registry of Norway

Detecting differential expression in microarray data: comparison of optimal procedures

Author: Calza Stefano
Pawitan Yudi
Perelman Elena
Ploner Alexander
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: Many procedures for finding differentially expressed genes in microarray data are based on classical or modified t-statistics. Due to multiple testing considerations, the false discovery rate (FDR) is the key tool for assessing the significance of these test statistics. Two recent papers have generalized two aspects: Storey et al. (2005) have introduced a likelihood ratio test statistic for two-sample situations that has desirable theoretical properties (optimal discovery procedure, ODP), but uses standard FDR assessment; Ploner et al. (2006) have introduced a multivariate local FDR that allows incorporation of standard error information, but uses the standard t-statistic (fdr2d). The relationship and relative performance of these methods in two-sample comparisons is currently unknown. METHODS: Using simulated and real datasets, we compare the ODP and fdr2d procedures. We also introduce a new procedure called S2d that combines the ODP test statistic with the extended FDR assessment of fdr2d. RESULTS: For both simulated and real datasets, fdr2d performs better than ODP. As expected, both methods perform better than a standard t-statistic with standard local FDR. The new procedure S2d performs as well as fdr2d on simulated data, but performs better on the real data sets. CONCLUSION: The ODP can be improved by including the standard error information as in fdr2d. This means that the optimality enjoyed in theory by ODP does not hold for the estimated version that has to be used in practice. The new procedure S2d has a slight advantage over fdr2d, which has to be balanced against a significantly higher computational effort and a less intuititive test statistic

Super-sparse principal component analyses for high-throughput genomic data

Author: Lee Donghwan
Lee Woojoo
Lee Youngjo
Pawitan Yudi
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Principal component analysis (PCA) has gained popularity as a method for the analysis of high-dimensional genomic data. However, it is often difficult to interpret the results because the principal components are linear combinations of all variables, and the coefficients (loadings) are typically nonzero. These nonzero values also reflect poor estimation of the true vector loadings; for example, for gene expression data, biologically we expect only a portion of the genes to be expressed in any tissue, and an even smaller fraction to be involved in a particular process. Sparse PCA methods have recently been introduced for reducing the number of nonzero coefficients, but these existing methods are not satisfactory for high-dimensional data applications because they still give too many nonzero coefficients. Results Here we propose a new PCA method that uses two innovations to produce an extremely sparse loading vector: (i) a random-effect model on the loadings that leads to an unbounded penalty at the origin and (ii) shrinkage of the singular values obtained from the singular value decomposition of the data matrix. We develop a stable computing algorithm by modifying nonlinear iterative partial least square (NIPALS) algorithm, and illustrate the method with an analysis of the NCI cancer dataset that contains 21,225 genes. Conclusions The new method has better performance than several existing methods, particularly in the estimation of the loading vectors.</p

Crossref

Correlation test to assess low-level processing of high-density oligonucleotide microarray data

Author: Bergh Jonas
Hall Per
Miller Lance D
Pawitan Yudi
Ploner Alexander
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: There are currently a number of competing techniques for low-level processing of oligonucleotide array data. The choice of technique has a profound effect on subsequent statistical analyses, but there is no method to assess whether a particular technique is appropriate for a specific data set, without reference to external data. RESULTS: We analyzed coregulation between genes in order to detect insufficient normalization between arrays, where coregulation is measured in terms of statistical correlation. In a large collection of genes, a random pair of genes should have on average zero correlation, hence allowing a correlation test. For all data sets that we evaluated, and the three most commonly used low-level processing procedures including MAS5, RMA and MBEI, the housekeeping-gene normalization failed the test. For a real clinical data set, RMA and MBEI showed significant correlation for absent genes. We also found that a second round of normalization on the probe set level improved normalization significantly throughout. CONCLUSION: Previous evaluation of low-level processing in the literature has been limited to artificial spike-in and mixture data sets. In the absence of a known gold-standard, the correlation criterion allows us to assess the appropriateness of low-level processing of a specific data set and the success of normalization for subsets of genes